GH-32339: [C++][Python][Parquet] Implement direct reads of Parquet RLE encoded data into Arrow REE by lesterfan · Pull Request #46103 · apache/arrow

lesterfan · 2025-04-10T21:46:58Z

Rationale for this change

Parquet files often contain columns with highly repetitive values (e.g., status codes, categories, constant metadata fields). Currently, Arrow reads these into dense arrays, materializing every value and consuming significant memory.

This PR implements direct reads of Parquet RLE (Run Length Encoded) data into Arrow REE (Run End Encoded) representation as described in #32339, using the existing read_dictionary API as inspiration for interface level changes. Like read_dictionary, this feature is currently only supported for columns with a Parquet physical type of BYTE_ARRAY, such as string or binary types.

Example usage:

 import pyarrow.parquet as pq
 # These columns will be directly read as Arrow run-end-encoded without full materialization if their Parquet representation was run-length-encoded.
 table = pq.read_table('data.parquet', read_ree=['category', 'status'])

This is a considerably hefty feature so please let me know if there's anything I can do to help the review process (e.g. by splitting this into multiple smaller PRs). The way I implemented this is by adding a GetNextValueAndNumRepeats API to the RleBitPackedDecoder which skips materialization of all values for RleRuns while still going through BitPackedRuns bit by bit. I'm definitely open to suggestions on other approaches here.

Regarding performance, I am anecdotally observing an order of magnitude (i.e. ~10x) speedup for reading columns which have lots of repeated values and a slight performance degredation for columns which contain purely unique values when read with this feature enabled (noting that the feature is toggled by the user, so presumably they would have a good understanding of the shape of their data to decide whether to enable/disable this feature). I haven't done any scientific benchmarking outside of this; let me know if that would be helpful.

What changes are included in this PR?

Add ArrowReaderProperties::set_read_ree() / read_ree() methods to enable REE reading per-column
Implement GetNextValueAndNumRepeats() and GetNextValueAndNumRepeatsSpaced() methods inRleBitPackedDecoder
Add ByteArrayReeRecordReader for decoding Parquet BYTE_ARRAY columns to RunEndEncodedArray
Support REE decoding for both Plain and RLE_DICTIONARY encodings
Add Python bindings via read_ree_columns parameter in ParquetDataset and ParquetFile

Are these changes tested?

Yes, through included C++ unit tests and pytests.

Are there any user-facing changes?

Yes.

GitHub Issue: [C++] Reading and Writing RLE data from/to Parquet #32339

github-actions · 2025-04-10T21:47:25Z

⚠️ GitHub issue #32339 has been automatically assigned in GitHub to PR creator.

github-actions · 2025-04-10T21:50:42Z

⚠️ GitHub issue #32339 has been automatically assigned in GitHub to PR creator.

lesterfan · 2025-10-25T22:01:50Z

Tagging @pitrou and @raulcd for some initial feedback on the implementation. I'm happy to split this into smaller PRs if that's easier for review (though guidance on how to split it would be appreciated), or make design changes based on your input. Tagging you both since we've worked together previously and I see you've reviewed recent REE changes, but feel free to suggest other reviewers if more appropriate.

pitrou · 2025-10-27T13:37:07Z

@lesterfan I suggest you first rebase or merge your work on git main, because the changes in #47294 might make this work easier for you.

…types

lesterfan · 2025-10-27T14:07:05Z

@pitrou Yup I spent a while rebasing this work on main over the weekend! (This was another reason I wanted to bump this 😄)

lesterfan · 2025-10-30T17:24:36Z

@pitrou I rebased this PR on current main. Let me know if there's anything else I can do to help review here.

I also tested this again locally with a Parquet file with many repeated values and am anecdotally observing a ~10x speedup in reads using this code path.

lesterfan requested a review from wgtmac as a code owner April 10, 2025 21:46

github-actions Bot added Component: Parquet Component: C++ Component: Python awaiting review Awaiting review labels Apr 10, 2025

lesterfan mentioned this pull request Apr 10, 2025

[C++] Reading and Writing RLE data from/to Parquet #32339

Open

lesterfan changed the title ~~GH-32339: [C++][Parquet] Implement direct reads of Parquet RLE encoded data into Arrow REE~~ GH-32339: [C++][Python][Parquet] Implement direct reads of Parquet RLE encoded data into Arrow REE Apr 10, 2025

lesterfan mentioned this pull request Apr 10, 2025

[C++][Parquet] RleDecoder::Get fails to return false on end for some bit widths #46094

Closed

lesterfan force-pushed the rle-ree-rebased-on-main branch from b79af8b to 8dca469 Compare April 12, 2025 03:43

lesterfan force-pushed the rle-ree-rebased-on-main branch from 8dca469 to 40aedcc Compare October 25, 2025 21:33

lesterfan requested review from AlenkaF, raulcd and rok as code owners October 25, 2025 21:33

Implement direct reads of Parquet RLE data into Arrow REE for string …

3547e1c

…types

lesterfan force-pushed the rle-ree-rebased-on-main branch from 40aedcc to 3547e1c Compare October 27, 2025 14:05

Split REE test from dictionary tests

1f254cd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-32339: [C++][Python][Parquet] Implement direct reads of Parquet RLE encoded data into Arrow REE#46103

GH-32339: [C++][Python][Parquet] Implement direct reads of Parquet RLE encoded data into Arrow REE#46103
lesterfan wants to merge 2 commits intoapache:mainfrom
lesterfan:rle-ree-rebased-on-main

lesterfan commented Apr 10, 2025 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 10, 2025

Uh oh!

github-actions Bot commented Apr 10, 2025

Uh oh!

lesterfan commented Oct 25, 2025

Uh oh!

pitrou commented Oct 27, 2025

Uh oh!

lesterfan commented Oct 27, 2025

Uh oh!

lesterfan commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lesterfan commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions Bot commented Apr 10, 2025

Uh oh!

github-actions Bot commented Apr 10, 2025

Uh oh!

lesterfan commented Oct 25, 2025

Uh oh!

pitrou commented Oct 27, 2025

Uh oh!

lesterfan commented Oct 27, 2025

Uh oh!

lesterfan commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lesterfan commented Apr 10, 2025 •

edited

Loading